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Abstract 

We consider Conditional Random Fields 
(CRFs) with pattern-based potentials defined 
on a chain. In this model the energy of a 
string (labeling) the sum of terms 

over intervals where each term is non- 
zero only if the substring Xi . . . Xj equals a 
prespecified pattern a. Such CRFs can be 
naturally applied to many sequence tagging 
problems. 

We present efficient algorithms for the three 
standard inference tasks in a CRF, namely 
computing (i) the partition function, (ii) 
marginals, and (iii) computing the MAP. 
Their complexities are respectively O(nL), 
0(nL£ max ) and 0{nL min{|D|, log(£ max +l)}) 
where L is the combined length of input 
patterns, £ max is the maximum length of 
a pattern, and D is the input alphabet. 
This improves on the previous algorithms 
of (Ye et al., 2009) whose complexities are 
respectively 0(nL\D\), O (n|r|L 2 ^ ax ) and 
0(nL\D\), where |T| is the number of input 
patterns. 

In addition, we give an efficient algorithm for 
sampling. Finally, we consider the case of 
non-positive weights. (Komodakis & Para- 
gios, 2009) gave an 0(nL) algorithm for com- 
puting the MAP. We present a modification 
that has the same worst-case complexity but 
can beat it in the best case. 

1. Introduction 

This paper addresses the sequence labeling (or the 
sequence tagging) problem: given an observation z 
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(which is usually a sequence of n values), infer label- 
ing x = x± . . . x n where each variable Xi takes values 
in some finite domain D. Such problem appears in 
many domains such as text and speech analysis, signal 
analysis, and bioinformatics. 

One of the most successful approaches for tackling the 
problem is the Hidden Markov Model (HMM). The fcth 
order HMM is given by the probability distribution 
p(x|z) = exp{— E(x\z)} with the energy function 

ie[l,n] (i,j)e£ k 

where £ k = {(«, i + k) \i G [1, n — k]} and Xi-j — Xi . . .Xj 
is the substring of x from i to j. A popular generaliza- 
tion is the Conditional Random Field model (Laffcrty 
et al., 2001) that allows all terms to depend on the full 
observation z: 

E{x\z) = ^2 i>i(xi,z)+ X] ^ij{xi-. ,z) (2) 
ie[i,n] (i,j)e£k 

We study a particular variant of this model called a 
pattern-based CRF. It is defined via 

"Sr [»,j]C[l,n] 

j-i+i=H 

where T is a fixed set of non-empty words, \a\ is the 
length of word a and [•] is the Iverson bracket. If we 
take r = D U D k then (3) becomes equivalent to (2); 
thus we do not loose generality (but gain more flexi- 
bility). 

Intuitively, pattern-based CRFs allow to model long- 
range interactions for selected subsequences of labels. 
This could be useful for a variety of applications: in 
part-of-speech tagging patterns could correspond to 
certain syntactic constructions or stable idioms; in pro- 
tein secondary structure prediction - to sequences of 
dihedral angles corresponding to stable configuration 
such as a-helixes; in gene prediction - to sequences 
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of nucleatydes with supposed functional roles such as 
"exon" or "intron" , specific codons, etc. 

Inference This paper focuses on inference algorithms 
for pattern-based CRFs. The three standard inference 
tasks are (i) computing the partition function Z = 
~Ylix ex P{ — E(x\z)}; (ii) computing marginal probabili- 
ties p(xi-.j = a\z) for all triplets a) present in (3); 
(hi) computing MAP, i.e. minimizing energy (3). The 
complexity of solving these tasks is discussed below. 
We denote L = J^aer \ a \ to be total length of patterns 
and f max = max QS r M to be the maximum length of 
a pattern. 

A naive approach is to use standard message passing 
techniques for an HMM of order k = £ max — 1 ■ However, 
they would take 0(n\D\ k+1 ) time which would become 
impractical for large k. More efficient algorithms with 
complexities 0(nL\D\), O (n|r|L 2 ^ 2 nax ) and 0(nL\D\) 
respectively were given by (Ye et al., 2009). 1 Our first 
contribution is to improve this to O(nL), 0(nL£ max ) 
and 0(nL •min{|£>|,log(^ max + l)}) respectively (more 
accurate estimates are given in the next section). 

We also give an algorithm for sampling from the dis- 
tribution p(x\z). Its complexity is either (i) 0{nL) 
per sample, or (ii) 0(n) per sample with an 0(nL\D\) 
preprocessing (assuming that we have an oracle that 
produces independent samples from the uniform dis- 
tribution on [0, 1] in O(l) time). 

Finally, consider the case when all costs ^{z) are 
non-positive. (Komodakis & Paragios, 2009) gave an 
O(nL) technique for minimizing energy (3) in this 
case. We present a modification that has the same 
worst-case complexity but can beat the algorithm of 
(Komodakis & Paragios, 2009) in the best case. 

Related work The works of (Ye et al., 2009) and 
(Komodakis & Paragios, 2009) mentioned above are 
probably the most related to our paper. The former 
applied pattern-based CRFs to the handwritten char- 
acter recognition problem and to the problem of iden- 
tification of named entities from texts. The latter con- 
sidered a pattern-based CRF on a grid for a computer 
vision application; the MAP inference problem in (Ko- 
modakis & Paragios, 2009) was converted to sequence 
labeling problems by decomposing the grid into thin 
"stripes" . 

(Qian et al., 2009) considered a more general formu- 
lation in which a single pattern is characterized by a 
set of strings rather than a single string a. They pro- 

1 Some of the bounds stated in (Ye et al., 2009) are actu- 
ally weaker. However, it is not difficult to show that their 
algorithms can be implemented in times stated above, us- 
ing our Lemma 1. 



posed an exact inference algorithm and applied it to 
the OCR task and to the Chinese Organization Name 
Recognition task. However, their algorithm could take 
time exponential in the total lengths of input patterns; 
no subclasses of inputs were identified which could be 
solved in polynomial time. 

A different generalization (for non-sequence data) was 
proposed in (Rothcr et al., 2009). Their inference pro- 
cedure reduces the problem to the MAP estimation 
in a pairwise CRF with cycles, which is then solved 
with approximate techniques such as BP, TRW or 
QPBO. This model was applied to the texture restora- 
tion problem. 

(Nguyen et al., 2011) extended algorithms in (Ye et al., 
2009) to the Semi-Markov model (Sarawagi & Cohen, 
2004). We conjecture that our algorithms can be ex- 
tended to this case as well, and can yield a better com- 
plexity compared to (Nguyen et al., 2011). 

2. Notation and preliminaries 

First, we introduce a few definitions. 

• A pattern is a pair a = {[i,j],x) where is an 
interval in [1, n] and x — Xi . . . Xj is a sequence over 
alphabet D indexed by integers in (j ' > i — 1). 
The length of a is denoted as \a\ = \x\ = j — i + 1. 

• Symbols "*" denotes an arbitrary word or pattern 
(possibly the empty word e or the empty pattern 
e s = ( [s + 1, s], e) at position s). The exact meaning 
will always be clear from the context. Similary, "+" 
denotes an arbitrary non-empty word or pattern. 

• The concatenation of patterns a = (ji,j],x) and 
(3 = ([j + l,k],y) is the pattern a/3 = ([i,k],xy). 
Whenever we write a/3 we assume that it is defined, 
i.e. a = ([•, j], •) and /3 = ([j + 1, •], •) for some j. 

• For a pattern a = ([i,j],x) and interval [k,£] C 

the subpattern of a at position [k,£] is the pattern 
a k ., e = ([k,e],x k: i) where x k .j = x k ...x e . 
If k = i then a k -i is called a prefix of a. If £ = j 
then a k -.i is a suffix of a. 

• If (3 is a subpattern of a, i.e. j3 — a k: t for some 
[k, £], then we say that f3 is contained in a. This is 
equivalent to the condition a = *j3*. 

• D l 'J = {([i,j],x) | x £ D^ri} is the set of patterns 
with interval We typically use letter x for pat- 
terns in and letters a, (3, . . . for other patterns. 
Patterns x G D 1:s will be called partial labelings. 

• For a set of patterns n and index s € [0, n] we denote 
II S to be the set of patterns in II that end at position 
s: n s = {([i,a],a) G n}. 

• For a pattern a let a~ be the prefix of a of length 
\a\ — 1; if a is empty then a" is undefined. 
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We will consider the following general problem. Let 
11° be the set of patterns of words in L placed at all 
possible positions: IT = {([i,j],a) | a £ F)}. Let 
(i?, ©, ®) be a commutative semiring with elements 
0, 1 G R which are identities for © and ® respectively. 
Define the cost of pattern x £ D 1 '^ via 



f{x)= (g) Co 



(4) 



where c a € R are fixed constants. (Throughout the 
paper we adopt the convention that operations © and 
(g> over the empty set of arguments give respectively O 
and 1, and so e.g. f{e s ) = 1.) Our goal is to compute 



m 



(5) 



xeD 1: 



Example 1 If (R, ffi, ®) = (M, +, x ) then problem (5) 
is equivalent to computing the partition function for 
the energy (3) ; if we set CQjj] )Q! ) = exp{— ififj(z)}. 

Example 2 If (R, ffi,®) = (R,min, +) w/iere 1 = 
K U {+00} then we get the problem of minimizing en- 
ergy (3), ifcQ iAa ) =ip? j (z). 

The complexity of our algorithms will be stated in 
terms of the following quantities: 

• P = \{a I 3a* £ T,a e}\ is the number of distinct 
non-empty prefixes of words in L. Note that P < L. 

• P' = |{a|3a+ £ r}| is the number of distinct proper 
prefixes of words in L. There holds & £ [1, \D\], 

If L = D 1 U D 2 U . . . U D k then £ = \D\. If V is a 
sparse random subset of the set above then -pj w 1. 

• 7(r) = {a I 3a*, *a £ T,a 7^ e} is the set of non- 
empty words which are both prefixes and suffixes of 
some words in T. Note that TCI(T) and \I(T)\ <P. 

We will present 6 algorithms: 

Sec. 3: Q(nP) algorithm for the case when (7?, ©,<g>) 
is a ring, i.e. it has operation that satisfies (a © b) © 
b = a for all a, b £ R. This holds for the semiring in 
Example 1 (but not for Example 2). 
Sec. 4: Q(nP) algorithm for sampling. Alternatively, 
it can be implemented to produce independent samples 
in O(n) time per sample with a Q(nP\D\) preprocess- 
ing. 

Sec. 5: 0(nJ2 a ^i(r) \ a \) algorithm for computing 
marginals for all patterns a S 11°. 
Sec. 6: Q(nP'\D\) algorithm for a general commu- 
tative semiring, which is equivalent to the algorithm 
in (Ye et al., 2009). It will be used as a motivation for 
the next algorithm. 
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Figure 1. Graph G[I1 3 ] for the set of 8 patterns shown on 
the left (for brevity, their intervals are not shown; they all 
end at the same position s.) This set of patterns would 
arise if T = {0, 1, 1000, 1010} and II was defined as the set 
of all prefixes of patterns in il° . 

Sec. 7: 0(nP log P) algorithm for a general commuta- 
tive semiring; for the semiring in Example 2 the com- 
plexity can be reduced to 0(nP log(£ max + 1)). 
Sec. 8: 0(nP) algorithm for the case (72,©,®) = 
(1, min, +), c a < for all a £ IT. 

All algorithms will have the following structure. Given 
the set of input patterns IT , we first construct another 
set of patterns II; it will typically be either the set of 
prefixes or the set of proper prefixes of patterns in 11° . 
This can be done in a preprocessing step since sets U s 
will be isomorphmic (up to a shift) for indexes s that 
are sufficiently far from the boundary. (Recall that H s 
is the set of patterns in LT that end at position s.) Then 
we recursively compute messages M s {a) for a £ U s 
which have the following interpretation: M s (a) is the 
sum ("©") of costs f(x) over a certain set of partial 
labelings of the form x = *a £ D 1:s . In some of the 
algorithms we also compute messages W s (a) which is 
the sum of f(x) over all partial labelings of the form 
x = *a £ D 1:s . 

Graph G[II S ] The following construction will be used 
throughout the paper. Given a set of patterns LT and 
index s, we define G[IL] = (IE, E^IE]) to be a directed 
graph with the following set of edges: (a,/3) belongs 
to -E[II S ] for a, (5 £ H s if a is a proper suffix of (3 
(f3 = +a) and II S does not have an "intermediate" 
suffix 7 of [3 with > I7I > \a\. It can be checked 
that graph G[II S ] is a directed forest. If e s £ U s then 
G[II S ] is connected and therefore is a tree. In this case 
we treat e s as the root. An example is shown in Fig. 1. 

Computing partial costs Recall that /(a) for a 
pattern a is the cost of all patterns inside a (eq. (4)). 
We also define <fr(a) to be the cost of only those pat- 
terns that are suffixes of a: 



(a)= (g) c p 



(6) 



Quantities </>(a) and /(a) will be heavily used by the 
algorithms below; let us show how to compute them 
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efficiently. 

Lemma 1. Let II be a set of patterns with e s £ II for 
all s € [0, n]. Values 4>{a) for all a £ II can be com- 
puted using 0(|II|) multiplications ("®"). The same 
holds for values f{a) assuming that IT is prefix-closed, 
i.e. a~ 6 II for all non-empty patterns a € II. 

Proof. To compute (/>(•) for patterns a 6 II S , we use 
the following procedure: (i) set <fi(s s ) := 1; (ii) traverse 
edges (a,/3) € -EpT] of tree G[II S ] (from the root to 
the leaves) and set 



(a) <E) Cf3 if (3 e n° 
(a) otherwise 



Now suppose that II is prefix-closed. After computing 
</>(•), we go through indexes s £ [0,ro] and set 

f(e s ):=t, f(a) :=/(«") ® 0(a) Va G Tl s - {s s } 

□ 



Sets of partial labelings Let II S be a set of patterns 
that end at position s. Assume that e s G II S . For a 
pattern a € H s we define 

X„{a) = {x E D 1:s \x = *a} (7) 
X s {a;Il s ) = X s {a)- U AT. 08) (8) 

(a,/3)SB[n 3 ] 

It can be seen that sets X s (a;U s ) are disjoint, and 
their union over a € Tl s is £> 1:s . Furthermore, there 
holds 

X s {a;Il s ) = {xe X„{a) \ x^*P V[3 = +a e U s } (9) 

We will use eq. (9) as the definition of X s (a; H s ) in the 
case when a £ II S . 

3. Computing partition function 

In this section we give an algorithm for computing 
quantity (5) assuming that (R, ©, ®) is a ring. This 
can be used, in particular, for computing the partition 
function. We will assume that D C T; we can always 
add D to T if needed 2 . 



2 Note that we still claim complexity 0(nP) where P is 
the number of distinct non-empty prefixes of words in the 
original set F. Indeed, we can assume w.l.o.g. that each 
letter in D occurs in at least one word oST. (If not, then 
we can "merge" non-occuring letters to a single letter and 
add this letter to F; clearly, any instance over the original 
pair (D, V) can be equivalenly formulated as an instance 
over the new pair. The transformation increases P only by 
1). The assumption implies that \D\ < P. Adding D to T 
increases P by at most P, and thus does not affect bound 
0{nP). 



First, we select set II as the set of prefixes of patterns 
in IT: 

Il = {a\3a*E n°} (10) 

We will compute the following quantities for each s G 
[0,n], a E Il s : 

M s (a) = f[x) , W s (a) = f(x) (11) 
xex s (a-n s ) xex s (a) 

It is easy to see that for «ell s the following equalities 
relate M s and W s : 

M s (a) = W s (a)Q W s ((3) (12a) 

(a,/3)eE[n 8 ] 

W s (a) = M s (a)® W s {(3) (12b) 
( a ,p)eE[n s ] 

These relations motivate the following algorithm. 
Since |II S | = P + 1 for indexes s that are sufficiently 
far from the boundary, its complexity is Q(nP) assum- 
ing that values 4>(a) in eq. (13a) are computed using 
Lemma 1. 

Algorithm 1 Computing Z = f(x) for a ring 

1: initialize messages: set Wo(eo) := O 
2: for each s = 1, . . . , n traverse nodes a £ H s of tree 
G[II S ] starting from the leaves and set 



M s (a):=^(a)® H/ s -i(a1e® W a -i(fi~) 

(a,/3)e£[n s ] 

W s {a):=M a {a)@ W s ([3) 

(a,/3)£B[n s ] 



(13a) 
(13b) 



Exception: if a = e s then set M s (a) :— O instead 
of (13a) 
3: return Z := W n (e n ) 

Theorem 2. Algorithm 1 is correct, i.e. it returns the 
correct value of Z = (§) x F[x). 

3.1. Proof of Theorem 2 

Eq. (13b) coincides with (12b); let us show that 
eq. (13a) holds for any a e H s — {e s }. (Note, for 
a = e s step 2 is correct: assumption D C T im- 
plies that D s:s C n s , and therefore X s (e s ;H s ) = 0, 
M s (e s )=0). 

For a partial labeling x € D 1:s define the "reduced 
partial cost" as 



(14) 



a£il° ,x— *a+ 
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It is easy to see from (11) that for any a £ II S — {e s } 
W.- l (pT)= ]T f~(x) (15) 

x£X„(a) 

Consider a G H s — {e s }. We will show that for any 
x £ X s {a) there holds 



[a;€* s (a;n s )]®/(a;) = 0(a)® 



(o,(3)6fi[n, 

xex,(p) 



(16) 

where [■] = 1 if the argument is true, and O other- 
wise. This will be sufficient for establishing the the- 
orem: summing these equations over x £ X s (a) and 
using (11), (15) yields eq. (13a). 

Two cases are possible: 

Case 1: x £ X„(fi) for some (a, (3) £ E[U S ]. (Such 
P is unique since sets X s (f3) are disjoint.) Then both 
sides of (16) are O. 

Case 2: x £ X s (a;H s ). Then eq. (16) is equivalent 
to f(x) = 4>(a) <E) f~(x). This holds since there is no 
pattern 7 G 11° (x) with \ j\ > \a\ (otherwise we would 
have 7 G IT and thus x £ X s (a;H s ) by definition (9) 
- a contradiction). 

4. Sampling 

In this section consider the semiring (R, ©, <E>) = 
(K, +, x) from Example 1. We assume that all costs c a 
are strictly positive. We present an algorithm for sam- 
pling labelings x £ D 1:n according to the probability 
distribution p(x) = f(x)/Z. 

As in the previous section, we assume that D C T, 
and define II to be the set of prefixes of patterns in 
IP (eq. (10)). For a node a £ IL S let T s (a) be the 
set of nodes in the subtree of G[II S ] rooted at a, with 
a £ T s (a) C II S . For a pattern a £ n s+1 — we 
define set 

A s (a)=T s (a-)- (J T.{fi~) (17) 
(Q,/3)eG[n 3+1 ] 

We can now present the algorithm. 

Algorithm 2 Sample x ~ p(x) = f(x)/Z 
1: run Algorithm 1 to compute messages M s (a) for 

all patterns a = ([•, s], •) G II 
2: sample a n ell„ with probability p(a n ) ocM n (a n ) 
3: for s = n— 1,...,1 sample a s £ A s (a s +i) with 

probability p(a s ) oc M s (a s ) 
4: return labeling x with x s - s = (a s ) s:s for s £ [1, n] 



We say that step s of the algorithm is valid if either 
(i) s — n, or (ii) s £ [l,n — 1], step s + 1 is valid, 
a s+ i 7^ e s+ i and M s (a) > for some a £ A s (a s +i). 
(This is a recursive definition.) Clearly, if step s is 
valid then line 3 of the algorithm is well-defined. 

Theorem 3. (a) With probability 1 all steps of the 
algorithm are valid, (b) The returned labeling x £ D 1[n 
is distributed according to p(x) = f(x)/Z. 

Complexity Assume that we have an oracle that 
produces independent samples from the uniform dis- 
tribution on [0, 1] in 0(1) time. 

The main subroutine performed by the algorithm is 
sampling from a given discrete distribution. Clearly, 
this can be done in 0(A) time where N is the number 
of allowed values of the random variable. With a 9(A) 
preprocessing, a sample can also be produced in O(l) 
time by the so-called "alias method" (Vose, 1991). 

This leads to two possible complexities: (i) 0(nP) 
(without preprocessing); (ii) 0(n) per sample (with 
preprocessing). Let us discuss the complexity of this 
preprocessing. Running Algorithm 1 takes <3(nP) 
time. After that, for each a £ H s +i we need to run the 
linear time procedure of (Vose, 1991) for distributions 
p(f3) oc M a (fi),f} £ A s (a s+ i). The following theorem 
implies that this takes Q(nP\D\) time. 

Theorem 4. There holds J^aen -{e } l^s-i( a )| = 

n* v »\- ' " ' 

Proof. Consider pattern f3 £ II s _i. For a letter 
a G D s:s let (3 a £ H s be longest suffix a of f3a with 
a £ Tl s (at least one such suffix exists, namely a). 
It can be seen that the set {f3 a | a £ D s:s } is ex- 
actly the set of patterns a £ H s — {e s } for which 
A s _i(a) contains f3 (checking this fact is just defi- 
nition chasing). Therefore, the sum in the theorem 

equals Eflen.-! \{F I a e ^ S }l = l n -il ' l^l- D 

To summarize, we showed that with a Q(nP\D\) pre- 
processing we can compute independent samples from 
p(x) in 0(n) time per sample. 

4.1. Proof of Theorem 3 

Suppose that step s £ [l,n] of the algorithm is valid; 
this means that patterns at for t £ [s,n] are well- 
defined. For t £ [s, n] we then define the set of patterns 
A t = A t (a t +i) CII( (if t = n then we define A t = IT 
instead) . We also define sets of labelings 

y t (a) = {yx t+1:n \y£X t (a;Ilt)} Va G At (18) 

y t = y t (a t ) (19) 
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where x is a labeling with x t . t — (a t ) t:t f° r t £ [s, n]. 
Let y n+1 = D l! ". 

Lemma 5. Suppose that step s £ [1, ft] is valid. 

(a) y s +i is a disjoint union of sets y s (a) over a £ A s . 

(b) For each y £ 34+1 = LLe.4 34(a) there holds 
f(y) — constg ■ f(yi-.s), cind consequently for any a € 
A s there holds 

Y f(y)= const s- ^2 f{y) = const S -M s (a) 
yey 3 (a) y ex 3 ( a -n 3 ) 



Theorem 3 will follow from this lemma. Indeed, the 
lemma shows that the algorithm implicitly computes 
a sequence of nested sets D 1:n = y n +i 2 y n 2 • ■ ■ 2 
3?i = {x}. At step s we divide set 34+i into disjoint 
subsets y s (a), a £ A s and select one of them, y s = 
y s (a s ), with the probability proportional to M s (a s ) oc 

We still need to show that if step s £ [2, n] is valid 
then step s — 1 is valid as well with probability 1. It 
follows from the precondition that a s sampled in line 
3 satisfies M s (a s ) > with probability 1; this im- 
plies that a s £ s . From the paragraph above we get 
that J2 y ey f(v) > ® with probability 1. We also have 

Eae^-i^-iW a Eoe^.-i ^»ey.-i(a) /(f) = 
J2 y ey s f(v) > implying that M s _i(a) > for some 
a £ As-i. This concludes the proof that step s — 1 is 
valid with probability 1. 

It remains to prove Lemma 5. 

Part (a) First, we need to check that X s (aJ +1 ;II~ +1 ) 
is equal to the disjoint union of X s (a;H s ) over a £ 
A s (a s+ i) where n~ +1 = {a~ | a £ II s+ i}. Disjoint- 
ness of X s (a; II S ) for different a £ H s is obvious. Since 
U.J +1 C II S , then for any a £ A s (a s+ i), X s (a;U s ) C 
X s (aJ +1 ;TlJ +1 ) is straightforward from the definition 
of A s (a s+ i ) . Thus, we only need to check the inclusion 
of X s (aJ +1 ;HJ +1 ) in the union. 

Elements of ILJ, x U can be seen as nodes in 

tree G[II S ]. Then any pattern x from X s (a~ +1 ;H~ +1 ) 
defines the longest suffix s(x) such that s(x) £ H s . 
It is easy to see that s(x) £ T s (aJ +1 ), and more- 
over, the descending path in G[IL,] from a~ +1 to s(x) 
does not contain elements from II~ +1 — {a^+i}' oth- 
erwise x, s(x) £ X 8 (aJ +1 ;Il~, 1 ). It is easy to see 
that this is equivalent to s(x) £ A s (a s+ i). Since 
x £ X s (s(x);~n s ), X s (a~ +1 ;T1~ +1 ) is a subset of the 
union of X s (a;U s ) over a £ A s (a s +i). 



Now according to definition of 3^ s +i we can write: 

y s+ i = {yx s+ 2-. n | y £ X s+1 (a s+1 ;U s+ x)} = 
{y(a s+1 ) s+1 ; S+1 x s+2 :n I y € X s (aJ +1 ;U~ +1 )} = 
[J {y(a s+1 ) s+1 . s+1 x s+2 :n I y e X s (a; U s )} 

aeA s (a B + 1 ) 

It only remains to check that in the last union the 
set corresponding to a £ A s (a s+ i) is exactly equal to 

y s (a). 

Part (b) Let p be the start position of a s +i, i.e. 
a s+ i = (\p,s + I], •). Consider labeling y £ y s +i, we 
then must have y — *a s +i*. Let j3 — be a 

pattern with y = *j3*, j > s. We will prove that i > p; 
this will imply the claim. 

Suppose on the contrary that i < p. Denote 7 = 
Pi-g+i, then 7 £ II s+ i and 7 = +a s +i. Therefore, 
Vi-.s+i 4- A" s+ i(a ;5+ i;II s+ i) (since yv. s +i = How- 
ever, this contradicts the assumption that y £ y s +i = 
y s +i{a s +i)- 

5. Computing marginals 

In this section we again consider the semiring 
(R, ©, <g)) = (M, +, x) from Example 1 where all costs 
c a are strictly positive, and consider a probablity dis- 
tribution p(x) — f(x)/Z over labelings x £ D 1:n . 

For a pattern a we define 

n(a) = {x£ D 1:n I x = *a*} (20) 

Z(a) = Y, ( 21 ) 

i6f!(a) 

We also define the set of patterns 

II = {a I 3a*, *a £ LI°, a is non-empty} (22) 

Note that 11° C n and |II S | = |/(r)| for indexes s 
that are sufficiently far from the boundary. We will 
present an algorithm for computing Z(a) for all pat- 
terns a £ II in time 0(nJ2 a& j(r) l a D- Marginal prob- 
abilities of a pattern-based CRF can then be computed 
as p(xi : j = a) = Z{a)/Z for a pattern a = ■)■ 

In the previous section we used graph G[U S ] for a set 
of patterns II S ; here we will need an analogous but a 
slightly different construction for patterns in LI. For 
patterns a,j3 we write a C ft if = *a*. If we have 
P — +a+ then we write a n p. 

Now consider a £ II. We define $(a) to be the set of 
patterns /3 £ II such that a C f3 and there is no other 
pattern 7 € LI with a C 7 C /3. 

Our algorithm is given below. In the first step it runs 
Algorithm 1 from left to right and from right to left; as 
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a result, we get forward messages W j (a) and backward 
messages Wi(a) for patterns a = ([i,j], •) such that 

= E ^( Q )= E /(f) ( 23 ) 



a;— *q 



y— 



Algorithm 3 Computing values -£(«) 
1: run Algorithm 1 in both directions to get messages 
Wj(a), Wi(a). For each pattern a — ([i, j], •) £ II 
set 



/(«) 



(24a) 



W^~(a) := 



^-i(a t: j-i)t^. 1+ i(a t+ i : j) , 24b . 

/(Oi+l:j-l) 

2: for aell (in the order of decreasing \a\) set 

Z(a):=W(a) + E (25) 

/3e*(a) 



Theorem 6. Algorithm 3 is correct. 

We prove the theorem in section 5.1, but first let us 
discuss algorithm's complexity. We claim that all val- 
ues f(a) used by the algorithm can be computed in 
0(n(P + S)) time where P and S are respectively the 
number of distinct non-empty prefixes and suffixes of 
words in T. Indeed, we first compute these values for 
patterns in the set 11 = {a \ 3a* £ 11°}; by Lemma 1, 
this takes 0(nP) time. This covers values f(a) used 
in eq. (24a). As for the value in eq. (24b) for pattern 
a = •) £ II, we can use the formula 



/(a i+ i :i _i) 



f(a)c a 



<j) (a) (j) (a) 

where c a — c a if a £ LT° and c a = 1 otherwise, and 



<p («) = n c p ' ^ («) = n 

The latter values can be computed in 0(n(P+S)) time 
by applying Lemma 1 in the forward and backward 
directions. (In fact, there were already computed when 
running Algorithm 1.) 

We showed that step 1 can be implemented in 0(n(P+ 
S)) time; let us analyze step 2. The following 
lemma implies that it performs 0(n^2 aeI rp\ \a\) arith- 
metic operations; since J2 a ei(r) \ a \ — Eaer \ a \ — 
max{P, S}, we then get that the overall complexity 
is °( n J2 a ei(r) M)- 



Lemma 7. For each [3 £ H there exist at most 2\/3\ 
patterns a £ II such that j3 £ $(a). 

Proof. Let \I> be the set of such patterns a. Note, there 
holds /? = +a+. We need to show that m = \^\ < 
2\/3\. Let us order patterns a — ([i,j],-) £ 4" lexico- 
graphically (first by i, then by j): \t = {a\, . . . , a m } 
with a t = ([it,jt], ■), and denote a t = (it — k) + (j t - 
k) £ [2, 2(£-k-l)] where [k, £] is the interval for /3. We 
will prove by induction that at > t + 1 for t £ [l,m]; 
this will imply that m + 1 < u m < 2(1 — k — 1) = 
2(|/3 1 - 2), as desired. 

The base case is trivial. Suppose that it holds for 
t — 1; let us prove it for t. If i t = i t _i then j t > 
jt—i by the definition of the order on 'I', so the claim 
holds. Suppose that i t > it-i- If jt < jt-i then 
a t C a.t-i C /3 contradicting the condition j3 £ Q(at). 
Thus, jt > Jt— l, and so the claim of the induction step 
holds. □ 



Remark 1. An alternative method for comput- 
ing marginals with complexity O (n\F\L 2 £^ nax s j was given 
in (Ye et al., 2009). They compute value Z(a) directly 
from messages Mji(-) and Mj'(-) by summing over pairs 
of patterns (thus the square factor in the complexity). In 
contrast, we use a recursive rule that uses previously com- 
puted values of Z(-). We also use the existence of the "0" 
operation. This allows us to achieve better complexity. 

5.1. Proof of theorem 6 

Consider labeling x £ D n . We define A(x) — {a £ 
n° | x = *a*} to be the set of patterns contained in x. 
For an interval C [l,n] we also define sets 

A {j (x) = {0 £ A(x) \ /3 = +x i:j +} (26a) 
Ar^x) = {0 £ A(x) \ p = *x i:j *} (26b) 



and corresponding costs 
fij(x) = 



n ( 27a ) 

n ( 27b ) 

/5eA(x)-Ay(x) 

It can be checked that quantities W(a), W~(a) de- 
fined via (23) and (24) satisfy 

W(a) = J2 fij (*) W~ (a) = E (*) ( 28 ) 
where [z, j] is the interval for a. 



Inference algorithms for pattern-based CRFs on sequence data 



Consider pattern a = ■) G IT. We will show that 

for any x G 0(a) there holds 

f(x) = f ij (x)+ [/(*)-/«(*)] (29) 

P=([k,e\,-)e<l>(a): 

This will be sufficient for establishing algorithm's cor- 
rectness: summing these equations over x G Q(a) and 
using (21), (28) yields eq. (25). 

Lemma 8. The sum in (29) contains at most one 
pattern f3 = ([k,l],-) G $(a) with x = *f3*. 

Proof. Consider two such patterns (3 1 — ([fc 1 ,^ 1 ],-) 
and /3 2 = ([fc 2 ,£ 2 ],-). Define k = maxj/c 1 , k 2 }, £ = 
min{^V 2 }, /3 = x k -M then a C ft C /3* for t G {1,2}. 
Using the definition of set II, it can be checked that 
P G II. The fact that /3* G $(a) then implies that 
P t = P for t G {1, 2}, and so /3 1 = /3 2 . □ 

We now consider two possible cases. 

Case 1: There are no patterns P — ([k,£], •) G <5(a) 
with x = This implies that Ajj(x) is empty, and 
therefore f(x) = fij(x). Eq. (29) thus holds. 

Case 2: There exists a (unique) pattern /3 = 
([k,£],-) G $(a) with x = *P*. Eq. (29) then be- 
comes equivalent to the condition fij(x) — f^(x). We 
will prove this by showing that Aij(x) = A^Jx). 

The inclusion Aj^(x) C Ajj(x) is obvious; let us show 
the other direction. Suppose that 7 = ([p,q],-) G 
Kij{x). Define p — max{fc,p}, q = min{(7,£}, 7 = Xp : q. 
It can be checked that 7 G II. We also have a c 7 C /3. 
Therefore, condition /3 G $(a) implies that 7 = P, and 
so p < k, q > £, and 7 G A^ e (x). 

6. General case: 0(nP'|D|) algorithm 

In this section and in the next one we consider the case 
of a general commutative semiring (R, ©, <g)) (without 
assuming the existence of an inverse operation for ©) . 
This can be used for computing MAP in CRFs con- 
taining positive costs c a . The algorithm closely resem- 
bles the method in (Ye et al., 2009); it is based on the 
same idea and has the same complexity. Our primary 
goal of presenting this algorithm is to motivate the 
0(nP log(£ max + 1)) algorithm for the MAP problem 
given in the next section. 

First, we select II as the set of proper prefixes of pat- 
terns in n°: 

n = {a I 3a+ G IT} (30) 



For each a G II S we will compute message 

M» - f(x) (31) 

x£X B (ct;Tl B ) 

In order to go from step s— 1 to s, we will use an 
extended set of patterns II S : 

n s - {a|a-Gn s _!}U{e s } (32) 
= {etc I aen s _i,ceD s:s } U {e s } 

It can be checked that 

n s c n s and n° c n s (33) 

In step s we compute values M s {a) in eq. (31) for all 
a G II S . Note, we now use the generalized definition of 
X s (a;H s ) (eq. (9)) since we may have a ^ IT. After 
completing step s, messages M s (a) for a G II S — LT S can 
be discarded. 

Our algorithm is given below. We have |II S | =P'\D\ + 1 
for indexes s that are sufficiently far from the bound- 
ary, and thus the algorithm's complexity is Q(nP'\D\) 
(if Lemma 1 is used for computing values </>(a)). 



Algorithm 4 Computing Z = ieJ) i : „ f(x) 

1: initialize messages: set M (eo) © 
2: for each s = 1, . . . , n traverse nodes a G IT of tree 
G[II S ] starting from the leaves and set 

M a (a):=[<j>(a)» M s _i(a")] © M S (P) (34) 

If a — e s then use M s _i(a") = O 
3: return Z := Qen „ M n(«) 



Theorem 9. Algorithm 4 is correct. 

Remark 2 As we already mentioned, Algorithm 4 resem- 
bles the algorithm in (Ye et al., 2009). The latter computes 
the same set of messages as we do but using the following 
recursion: for a pattern a £H S — {e s } they set 

M s (a):= <K7a)®Af a _i( 7 ) (35) 

7ST a _ 1 ( Q -)- U T„-lC8-) 
(o:ffl£E[n,] 

where a = a s:s is the last letter of a and T s -i(/3) — {7 1 -7 = 
*/3,7 G II s _i} for /3 G n s _i is the set of patterns in the 
branch of G[n s _i] rooted at /3. It can be shown that up- 
dates (34) and (35) are equivalent: they need exactly the 
same number of additions (and the same number of mul- 
tiplications, if <f>(-ya) in eq. (35) is replaced with 4>( a ) an d 
moved before the sum). 
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6.1. Proof of Theorem 9 

To prove the correctness, we need to show that eq. (34) 
holds for each a G n s . 

Lemma 10. For any a € Tl s there holds 

<t>{a)®M s -i{a-)= f{x) (36) 

xeX s (a;fl s ) 



Proof. For a — e s the claim is trivial: we have D s:s C 
U s (since e s -i G II s _i), therefore X s (a;H s ) = 
and the sum in (36) is O. We thus assume that 
a G n s — {e s }. Using definition (33), it can be checked 
that the mapping x H> x~ is a bijection X s (a;H s ) — > 
X s -i(a~; Il s _i). Consider x G X s (a;H s ). We claim 
that if x — *7 and 7 G IF then I7I < \a\. Indeed, we 
have 7 G II S (since 11° C n s ), and so if I7I > |a| then 
x £ X s (a;H s ) - a contradiction. 

Using the claim, we conclude that <fi(a) ® f{x~) = 
fix). This implies the lemma. □ 

The fact II S C II S implies the following characteriza- 
tion of X s {a; IT) for a G II S : 



X s {a-IL s ) = \x€ X s {a) 



x=/=*/3 for any j3 in the subtree 
of a in G[II S ] with /3gII s , f3j^ a 



where the subtree of a in G[II S ] is defined as the set 
of descendants of a in GpT] (including a). 

Now it becomes clear that X s ia;H s ) C X s (a; IT) and 
X s (a; II S ) — X s (a; H s ) equals the set of partial labelings 
x G X s (a) such that 

• x ends with /3 eU s - IT, (a, /3) G E[TI S ], and 

• x does not end with any pattern 7 G IT from the 
subtree of /3 in GpT]. 

It is easy to check that the last set of partial labelings 
equals X s ((3;Tl s ), and such sets for different /3's are 
disjoint. We showed that X s (a;H s ) is a disjoint union 
of sets X s (a;U s ) and X S (P; IT) for (a,0) e E[n s ],p(£ 
II S . This fact together with Lemma 10 implies eq. (34). 

7. General case: 0(nP log P) algorithm 

In the previous section we presented an algorithm 
for a general commutative semiring with complexity 
0(nP'\D\). In some applications the size of the input 
alphabet can be very large (e.g. hundreds or thou- 
sands), so the technique may be very costly. Below we 
present a more complicated version with complexity 



0(nP log P). If (R, 8, ®) = (M, min, +) then this can 
be reduced to 0(nP log(£ max + l)) using the algorithm 
for Range Minimum Queries by (Berkman & Vishkin, 
1993). We assume that D C T. 

We will use the same definitions of sets n s and II S as 
in the previous section, and the same intepretation of 
messages M s (a) given by eq. (31). We need to solve 
the following problem: given messages M s _i(a) for 
a G n s _i, compute messages M s (a) for a G H s . 

Recall that in the previous section this was done by 
computing messages M s (a) for patterns in the ex- 
tended set n s of size 0(P'\D\). The idea of our modi- 
fication is to compute these messages only for patterns 
in the set E, where 



S° C E, C IT, 



E° = IL u n° 



ISJ <2|E° 



Note that |E°| < P + 1. Patterns in E s will be called 
special. To define them, we will use the following no- 
tation for a node a G II S : 

• $ s (a) is the set of children of a in the tree G[n s ]. 

• T s (a) is the set of nodes in the subtree of GpT] 
rooted at a. We have a G T s (a) C U s . 

We now define set E s as follows: pattern a G IT is 
special if either (i) a G E°, or (ii) a has at least two 
children Pi, P2 G $ s (a) such that subtree T s {(3i) for i G 
{1, 2} contains a pattern from E°, i.e. T s ({3i)r\T,° ^ 0. 

The set of remaining patterns II S — E s will be split into 
two sets A s and B s as follows: 

• A s is the set of patterns a G n s — E° such that 
subtree T s (a) does not contain patterns from E°. 

• B s is the set of patterns a G n s — E° such that a has 
exactly one child (3 in G[fi 5 ] for which T s (/?)nE° 7^0. 

Clearly, n s is a disjoint union of >4 S , S s and E s . 

Consider a node a G B s . From the definition, a has 
exactly one link to a child in G[II S ] that belongs to 
B s U E s . If this child does not belong to E s , then it 
belongs to B s and the same argument can be repeated 
for it. By following such links we eventually get to a 
node in E s ; the first such node will be denoted as a^. 

We will need two more definitions. For an index t and 
patterns a,/3 ending at position t with (3 = +a wc 
denote 

Wt(a) = f{x) (37a) 

x€X t (a) 

V t (a,P) = f(x) (37b) 

xeX t (a)-X t {P) 
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Pi 



Figure 2. Structure of the subtree of G[II S ] rooted at a node 
a £ E s . White circles represent nodes in _4„ (so all their 
children are also white), gray circles - nodes in B s , and 
black circles - nodes in E s . Note, if /3 £ B s is a child of a 
then (a, fa) £ E[T, S ]. 

We can now formulate the structure of the algorithm. 

Algorithm 5 Computing Z = l££) i : „ f(x) 

1: initialize messages: set M (e) := O 
2: for each s = 1, . . . , n traverse nodes a £ S s of tree 
G[£ s ] starting from the leaves and set 

M s (a) := cf>(a)®[M a - 1 (a-)®A s (a)®B s (a)] 
© M.G9) (38) 



where 



(Q,/3)e£[s 3 ],/3^n s 

M<*) = W,-i(fi-) (39a) 
@e<s>s{a)nA s 

B s (a) = y s _i(/r,/v) (39b) 
/3e5 3 (a)nB s 



If a = e s then use M s _i(a ) := O 
3: return Z := aG n n M n (a) 

To fully specify the algorithm, we still need to describe 
how we compute quantities A s (a) and B s (a) defined 
by eq. (39a) and (39b). This is addressed by the the- 
orem below. 

Theorem 11. (a) Algorithm 5 is correct. 

(b) There holds \Y, S \ < 2|S°| - 1 < 2P + 1. 

(c) Let h be the maximum depth of tree G[II s _i]. 
(Note, h < £ max + 1.) With an 0(P' log h) prepro- 
cessing, values V s -i(a, /3) for any a,/3 € n s _! with 
(3 = +a can be computed in 0(log/i) time. 

(d) Values A s (a) for all a G S s can be computed in 
0(P log P) time, or in 0{P) time when (R, ffi,®) = 



(M,min, +). 

Clearly, the theorem implies that the algorithm 
can be implemented in 0(nP log P) time, or in 
0(nP log(£ max +l)) time when (R, ©, <g>) = (R, min, +). 
To see this, observe that the sum in (39b) is effectively 
over a subset of children of a in the tree G[S S ] (see 
Fig. 2), and this tree has size O(P). 

Before presenting the proof, let us make a few remarks. 
It can be easily checked that graph G[II S ] has the fol- 
lowing structure: the root e s has \D\ children, and for 
each child c e D s ' s C n s the subtree of G[II S ] rooted 
at c is isomorphismic to the tree G[II s _i]. The iso- 
morphism T s (c) — > II s _i is given by the the mapping 
a i— > a~ . 

For nodes a, f3 £ n s _i with (3 — +a we denote 
V s -i(a, /3) to be the unique path from a to j3 in 
G[II s _i] (treated as a set of edges in Anal- 
ogously, for nodes a, (3 £ H s with (3 = +a let P s (a, (3) 
be the unique path from a to f3 in G[II S ]. It follows 
from the previous paragraph that if a ^ e s then path 
P s (a,(3) is isomorphic to the path V s ^i(a~ , (3^). 

7.1. Proof of Theorem 11(a) 

The statement is equivalent to the correctness of (38). 
Let us first divide the sum over f3 in the last expression 
(38) into two parts: nodes f3 that belong to <&(a) and 
those that do not: 

M s (f3)® M s (/?)(40) 

It is easy to see that the second sum can be written as 
£/3e* s (a)nB s M s(Pl) ( see Fi §- 2 )- where 



Af.( 7 ) if 7 ^n s 

O otherwise 



(41) 



Using the distributive law, we can rewrite (38) as 

M s (a):=[(f>(a) ® M s -i(a~)] 

© [</>(«) ®WW/T)] 
Pe<s> s (a)nA s 

© [<Kct)®V a -x{pr,h-)\®M„(p ir ) 
/3e$ s (a)ne s 

© M.{fi) (42) 
^6$ s (a)n(s s -n s ) 

We will use eq. (34) (for which the correctness is al- 
ready proved) for the case when a £ S s . We can 
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rewrite it as follows (we use the fact that A s P\H S = 
B s nU s = 0): 

M s (a):=[4>(a)®M s - 1 (a-)] © M,{fi) 

© M s (0) © M s (0) (43) 
/9e$ 3 (a)nB 8 ^e$ s (a)n(s s -n 3 ) 

The first and the last terms of the sum in (43) equal 
to that of the sum in (42). The lemma below implies 
that the same holds for the second and third terms, 
thus proving the correctness of eq. (42). 

Lemma 12. (a) For any a £ A s there holds 

M s (a)=4>(a)®W s - 1 {a-) (44) 

(b) For any a £ B s there holds 

M s (a) = [0(a) ® V s -i{a-,a l -)]®M s (aj (45) 

Proof. Part (a) Suppose that a £ A s - This means 
that there are no patterns /? € 11° of the form (3 = 
+a (recall that 11° C E°). This in turn implies that 
M s (a) — W s (a). This also implies that for any x £ 
X s (a) there holds f(x) = f(x~), and consequently 
W s (a) =0(a)®W._i(o-). 

Part (b) Suppose that a £ B s . The definition of B s 
implies that set T s (a) — T s (a^) does not contain nodes 
in n s or in 11°. Using this fact and the definition of 
sets X s (-), X s (-; •) we get the following. 

(i) If a x £ n s then X s (a;U s ) = X s (a) - X s ( ai ). 

(ii) If ai ^ II S then X s (a;H s ) is a disjoint union of 
X s {a) - X s {a x ) and X a ( air ;U s ). 

(iii) Partial labeling x £ X s (a) — X s (a±) cannot end 
with a pattern f3 — +a £ 11°. (If such j3 exists then 
from the fact above we get (3 £ T a (a,|,), i.e. (3 = 
and x £ X s (aj r ) - a contradiction.) Therefore, for 
such x we have f(x) = </>(«) ® f(x~). This implies 
that E« B6 Ar.(a)-Ar.{c U ) f( x ) = ^(giVs-iia-,^-). 

Recall that M s (a) = T.x(iX s ( a -ii s ) fi x )- Usin g this 
fact and properties (i)-(iii), we conclude that (45) 
holds in each of the two cases (aj, £ II S and £ 

n s ). □ 

7.2. Proof of Theorem 11(b) 

For a node a £ U s we denote T°(a) = T s (a) n S°. 

Let us consider the process of a breadth-first search in 
the tree G[1L] starting from the root. At each step we 
will keep a certain set of nodes of the tree (which we 



call active nodes) , and the transition to the next step is 
made by choosing one of the active nodes and replacing 
it with its children. The process stops when the set of 
active nodes becomes equal to the set of the leaves of 
the tree. To each step of the process we correspond a 
partition of the set E° . The partition is defined by the 
following rule: if oti, . . . , are active nodes, then the 

k 

partition is E° = Q T °s{ a i) U {"}■ 

Let us denote the partition at step t as D t . 

Partitions of the set II S is a poset with respect to the 
natural order defined as follows: {Si} ieI < {Pj}- eJ if 
for any i £ I there is j £ J such that 5, C Pj. It is 

easy to see that Dq > > D% > Moreover, if 

a chosen active node a at step t is a special one and 
does not belong to E° then D t > D t +i- Indeed, there 
are at least two children of a, denoted as ai and 0:2, 
such that sets T°{a\) and T°{ct2) are nonempty (by 
the definition of a special node) . As step t + 1 these 
sets are separate components of partition D t+ i, but at 
step t these sets still belong to the same component of 
D t ; thus, D t ^ D t+1 . 

We know that the length of a chain D tl > D t2 > . . . 
cannot exceed |E°|. We conclude that the number of 
special patterns that do not belong to E° is bounded 
by |E°| - 1; this implies Theorem 11(b). 

7.3. Proof of Theorem 11(c) 

For brevity denote t = s — 1, and define a set of pairs 

J={(a, j3)\a, j3 £llt , is a strict descendant of a in G[n t ] } 

Note that -EflLj C J. In this section we describe a 
0(P' logh) preprocessing which will allow computing 
values V t (a,(3) for any (a, 0) £ J in 0(logh) time. 
The procedure will be based on the following observa- 
tion; it follows trivially from the definition (37b). 

Lemma 13. For any (a, (3) £ J there holds 

V t (a,p)= V t (a',f3') (46) 

During preprocessing we will compute values V t (a, (3) 
for pairs (a, (3) in a certain set J C J of size 
0(P' log h). In order to define J, we need some no- 
tation. For a pattern a £ lit let h a = \Pt{£t-,oi)\ be 
the height of a in G[IL], and for an integer d £ [0, h a ] 
let o^ d be the node of tree G[n t ] obtained from a by 
taking d steps towards the root. We now define 

,7= {(a Td ,a) I a £ U t ,d£ [0,h a } 7 d = 2 r for r £ Z> } 
The preprocessing will consist of 3 steps. 
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Step 1: compute values W t (a) for all a G II t . We do 
it by traversing nodes a G n t of tree G[IT t ] and setting 

W t (a) := M t (a)® W t [P) (47) 
(«,«e£pt] 

This takes O(P') time. 

Step 2: go through a G II t and compute values 
Vt(a, /3) for all /3 G $*(a) where <&t(a) is the set of 
children of a in G[II t ]. 

A naive way is to use the formula 

V t (a,P)=M t {a)(B W t {j) 

7e$t(a)-{/3} 

for all j3 G r t (a); however, this would take 0(k 2 ) 
time where fc = |$ t (a)|. Instead, we do the fol- 
lowing. Let us order patterns in $t(a) arbitrarily: 
= . . . ,/3 fc }. For i G [l,fc] denote 

i-l fe 
j=l J=i+1 

We compute these values in O(fc) time by setting 
Si := O, S k '■— O and then using recursions 

:=^eWt(A) Vi-i := © Wf(A) 

After that we set 

Vt(a,jSi) :=M t (a)®^©^ 

For a given a G II t the procedure takes 0(|$ t (a)|) 
time, and thus for all a G IL_ it takes O(P') time. 

We now have values V t {a, /3) for all (a, /3) G i?[II t ]. 

Step 3: compute values V t (a,/3) for all (a, (3) G J 
using the recursion 

V t (a tM , a) := V t (c^ 2d , a td ) + V 4 (a td , a) 

for d = 2°,2 1 ,...,2 r ,... and (c^ 2d ,a) G J. 

Evaluating queries for (a, /?) G J We showed how 
to compute values Vj(a,/3) for (a,/3) G J in time 
0(P' log h); let us now describe how to compute value 
Vt(ot, 0) for a given (a,/3) G J in time 0(log/i). Let 
us construct a sequence /3o,j3\, . . . , as follows: /3o = /?, 
and for i > let ft+i = /?J d where d is the maximum 
value such that d — 2 r for r G Z>o and /3,J rf is still a 
descendant of a. We stop when we get /3k = a; clearly, 
this happens after k — 0(logh) steps. We now set 

fc-i 

V t (a,/3) :=014(ft +1 ,ft) 

i=0 



7.4. Proof of Theorem 11(d) 

We will consider the general case of a commutative 
semiring and the case when (R, ©, ®) = (R, min, +); 
the latter will be called the MAP case. 

Let d a for a G n s _! be the number of children of a in 
G[n s _i], d max = maxagn^! d a , and d a for a G S s be 
the number of children of a in G[E S ]. We will present 
a OQ^aen _i ^qIoS^q) preprocessing technique that 
will allow computing value A s (a) for a G S s — {e s } 
in time 0((d a + 1) log d a -). The resulting complexity 
will be 

OiY.da log d max )+J20 {{d a + 1) log d max ) = 0{P log d max ) 

aGll s _i aeS a 

In the MAP case (i.e. when {R, ffi,®) = (R, min, +)) 
we will present a faster solution. Namely, the pre- 
processing will take 0{J2 aeYl _ d a ) = 0{P') time, 
and computing value A s {a) for a G S s — {£ s } will 
take 0(<i Q + 1) time, leading to the overall complexity 
0{P). For that we will use the Range Minimum Query 
(RMQ) problem which is defined as follows: given N 
numbers z\,...,zn, compute minfcg/ Zk for a given in- 
terval / = [i,j] C [l,iV]. It is known (Berkman & 
Vishkin, 1993) that with an 0{N) preprocessing each 
query for can be answered in O(l) time per interval. 

As in the previous section, we denote t = s — 1. We 
assume that for each a G II t we already have values 
Wt(a) for all a G II t (they were computed in the pre- 
vious section). 

Preprocessing Consider a G II f , and let us fix an 
ordering of children of a: $t(a) = {/3i, . . . , f3d} where 
d = d a . For an interval / C [1, d] we denote 

S / (a) = ©WiC8 i ) (48) 
iei 

The goal of preprocessing is to build a data structure 
that we will allow an efficient computation of Si {a) 
for any given interval /. 

In the MAP case we simply run the preprocess- 
ing of (Berkman & Vishkin, 1993) for the sequence 
W t {/3i), . . . , Wt{fid); this takes 0{d) time. Value Si {a) 
for an interval I can then be computed in O(l) time. 

In the general case we do the following. Define a set 
of intervals 

Jd = {[i,j] C [l,d]\j-i = 2 r -l,r GZ> } 

Note that \JJ\ = 0{dlogd). We compute quantities 
Si {a) for all / G Jd- This can be done in O(dlogd) 
time by setting S^^a) := W t {(3i) for i G and 
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then using recursions 

<5[i,»+2a-l](o0 = %,i+5-l](o0 © %+5,i+25-l](a) 

for <5 = 2°, 2 1 ,. 



Value Si (a) for an interval I can now be computed in 
0(logd) time. Indeed, we can represent J as a disjoint 
union of m = 0(\ogd) intervals from Jd- I = Ui™ 1 h 
with Ii G J,j. We can then use the formula 



(49) 



Computing A s (a) for a G E s — {e s } Denote 
d = <2 a -. As discussed earlier, $ s (a) (the set of chil- 
dren of a in G[II S ]) is isomorphic to $ s _i(a~). Let 
r s (a) = {/3i, . . . , (3d} be the ordering of patterns in 
$ s (a) such that (3^ , . . . , (3J is the ordering of patterns 
in $ s _i (a~ ) chosen in the preprocessing step. We need 
to compute 

A s (a) = W-xifiZ) , J = {i G [1,4 I A e A} 
ie j 

Denote J = [1, d] — J. We can represent J as a disjoint 
union of at most | J| + 1 intervals I\ , . . . , I rn where Ii C 
Clearly, | J| = d Q (see Fig. 2), and so m < d a + l. 
We can write 



A,(a) = 05 7< (a-) 



(50) 



As discussed above, each value S7;(c*~) can be com- 
puted in O(l) time in the MAP case and in O(logd) 
in the general case. Thus, computing A s (a) takes re- 
spectively 0(d a + 1) and 0((d a + l)logd Q -) time, as 
desired. 

8. MAP for non-positive costs 

In this section we assume that (R, S3, <g>) = (R, min, +) 
and c Q < for all a G IP. (Komodakis & Paragios, 
2009) gave an algorithm that makes Q(nP) compar- 
isons and @(nP) additions. We will present a modifi- 
cation that makes only 0(n\I(T)\) comparisons. The 
number of additions in general will still be 0(nP), but 
we will show that in certain scenarios it can be reduced 
using a Fast Fourier Transform (FFT). 

We will assume that L contains at least one word a 
with \a\ = 1 (it can always be added if needed). 

As usual, we first select a set of patterns II with n = 
{eo}; this step will be described later. For a pattern 
a G LT let of 5- be the longest proper prefix of a that 



is in LT (a = a^+, G II). If II does not contains 
proper prefixes of a then is undefined. 

We can now present the algorithm. 

Algorithm 6 Computing Z = min f(x) (if c a < 0) 

x£D 1:n 

1: initialize messages: set M (e ) := 
2: for each s = 1, ...,n traverse nodes a G II S of 
forest G[II S ] starting from the leaves and set 

M s (a) := mm{M p (a^)+tp(a), min M,(j3)} 

(51) 

where p is the end position of ot~ (a* - = ([-,£>], •)) 
and ip(ct) = f(a) — f(a*~). If is undefined then 
ignore the first expression in (51). 
3: return Z := min^gnn M n (a) 



Selecting LT It remains to specify how to choose set 
II. For patterns a, /?, 7 we define 

(a|/3| 7 ) ={u = +(3+ I a/?7 = *u*} (52) 

Theorem 14. Suppose that LIo = {£0} an d set LI con- 
tains set 



IL=l(3 



3 labeling xaftjy G D s.t. 

(a) a/3,/37 G IP; (b) {xa\P\yy) n n° = 



(53) 



T/ien ^4Z<7. # returns the correct value of Z — min fix). 

A proof of this theorem is given in Sec. 8.2. 

A simple valid option is to set LI = {a | 3a*, *a G IP}. 
Computing set LI is slightly more complicated, but can 
still be done in polynomial time for a given LI (we omit 
this procedure) . In order to analyze set LI, let us define 

J _ 3 word xa(3"fy with |ar| = \y\ = S 

6 = V s-t- ( a ) a ^ Pi e r ; (b) tMftiv) n r = 

where set (-|-|-) for words is defined similarly to (52): 
(a|/3|7) = {a/3j \ a = *a, 7 = 7* and a, 7 ^ e} (54) 

As (5 increases, set 1$ monotonically shrinks, and stops 
changing after S > £ max = max Qe r \ct\. We denote 
this limit set as loo, so that 1^ C 7 C 7(L) U {e}. 
It can be seen that U s — {([■,s],aj \a G loo} for all 

Complexity Assume that we use LI = LI. The algo- 
rithm performs two types of operations: comparisons 
(to compute minima) and arithmetic operations (to 
compute the first expression in (51)). The number of 
comparisons does not exceed the total number of edges 
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in graphs G[II S ] = (II S , i£[IIg]) for s £ [l,n], which is 
smaller than the number of nodes (since graphs are 
forests). Thus, comparisons take O(n\loa\) time. 

The time for arithmetic operations depends on how we 
compute quantities f(a). One possible approach is to 
use Lemma 1 for computing f(a) for all a £ II where 
II is the set of prefixes of patterns in IP (note that 
IT C n). We have |fi s | < P + 1, and therefore the 
resulting overall complexity is 0(nP). Next, we de- 
scribe an alternative approach based on a Fast Fourier 
Transform. 

8.1. Computing f(a) using FFT 

For a word a and index s > \a\ let a s be the pattern 
Qs — \a\ + 1, s], a). It is easy to see that 

/(«*) = ^2,fs{<AP) where fs{a\/3) = C A 

per t:a 3 =*P t * 

Lemma 15. For fixed words a,/3 quantities f s (a\f3) 
for s £ [1, ti] can be computed in 0(n log n) time. 

Proof. We assume that \a\ > |/3|, otherwise the claim 
is trivial. Let p = \a\ — \j3\ + 1, and define sequences 
a £ R"~I Q I +1 , £ R n ~l^l +1 , A € {0,1} P via 

^ = f l+ \ a \{a\P) Vie[l,n-|a|] 
b i = c (\j,j+\P\-i],P) Vj € [l,n- 1/3|+1] 

Afc = [atk:k+\P\-i = P\ Vke[l,p\ 
where [■] is the Iverson bracket. It can be checked that 
p 

a% = J~] b i+k Xk Vie[l,n-|a|] 

Thus, a is the convolution of b and the reverse of A. 
The convolution of such sequences can be computed in 
0(n log n) using a Fast Fourier Transform. □ 

In practice this method can be useful for computing 
f s (a\/3) when \a\ ^> \(3\. A natural way to do this 
is first choose a subset S C T of words that are very 
often included as subwords in words from 1^ (usually, 
this means S = {j3 \ j3 £ T, \j3\ < 6} where S is some 
threshold constant). Then computing f s (a\j3) for all 
ol £ loo an d f3 £ S will take O (|Ioo| ■ \S\ nlogn) time. 
For £ r — S quantities f s (a\f3) can be computed 
directly. 

An example of using such an approach is the following 
theorem; it is proved by taking 5 = 1 and S = D. 

Theorem 16. Suppose T = D U A where A consists 
of words of a fixed length t. Then Algorithm 6 can be 
implemented in O (\Ioc\ ■ \D\ nlogn) time. 



8.2. Proof of Theorem 14 (correctness) 

First, let us prove that for all patterns a = ([•, s], •) £ 
II there holds 



M s (a) > min f(x) 



(55) 



We use induction on the order used in the algorithm. 
The base case a = Sq is ensured by the initialization 
step. Consider pattern a £ II — {e }- Suppose that a*~ 
is defined; let p be the end position of a* - . Consider 
partial labeling y — *a^~ £ D 1:p , and let x be its 
unique extension by s — p letters (x = y+) such that 
x = *a. Due to non-positivity of costs cp we have 

f(x) = f(y)+rp{a)+ £ cp < f(y) + i>{a) 
j 9=([i J -],.)en° 

i<.s—\a\. j>p 

Applying the induction hypothesis and the inequality 
above yields the desired claim: 

M s (a) = mhi{MJa^) + %b(a), min MJB)} 

> mini min f(y)+ib(a), min min f(x)\ 

\=* a ^ JKyj ^ y ''w)6E[n»]x^ Jl n 

> min{ min f{x), min f(x)} = min f(x) 



where in the equations above x always denotes a par- 
tial labeling in D 1:s and y denotes a partial labeling 
in D 1:p . If is undefined then we can write similar 
inequalities but omitting expressions contaiting a*~ . 

By applying the claim to patterns a = ([-,n],-) £ LT 
we conclude that Z > mm xeD i n f(x). The remainder 
of this section is devoted to the proof of the reverse 
inequality: Z < min xe r)i: n f(x). 

Let us fix x* £ argmin^g^iin f(x). Let 

A = {a £ n° | x* = *a*} 

be the set of patterns present in x* , and 

A = {a £ A | there is no a £ A — {a} with a — *a*} 

be the set of maximal patterns in A. We can as- 
sume w.l.o.g. that for each k £ [l,n] there exists 
a = ([i, j], ■) £ A with k £ Indeed, if it is not the 

case for some k then we can modify x* by replacing 
the fc-th letter of x* with some letter c £ T n D; this 
operation does not increase f(x*). 

We define a total order ^ on patterns a — ([i, j], ■) £ 
A as the lexicographical order with components (i,j) 
(the first component is more significant). 
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Lemma 17. (a) For each pattern (3 G A there holds 

/? g n. 

(b) Consider two consecutive patterns ct\ -< a 2 in 

A with ai = ([ii,ji],-)> "2 = {[12,32], ')> and let 
(3 = x* 2 .j i be the pattern at which they intersect. (By 
the assumption above, i 2 < + There holds (3 G II. 

(c) For the patterns in (b), condition Mj 1 {a\) < 
f{x* 1 . n ) implies M h {a 2 ) < }{x\, h ). 

Proof. Part (a) We can write labeling x as x — 
xa/3jy where patterns a, f3 are empty. Let us show 
that this choice satisfies conditions in (53). Condition 
(a) holds since a/3 = /3j = /3 G IP. Suppose that (b) 
does not hold, then there exists pattern u = xt. t G IP 
with u = +(3+. We have u G A and thus /? ^ A - a 
contradiction. Therefore, /3 G II. 

Part (b) The definitions of ^ and A imply that 
*i < *2- (If *i = *2 then we must have j% < 32, 
but then a\ ^ A - a contradiction.) There also holds 
ji < 32 (otherwise we would have a 2 £ A - a contra- 
diction). This means that we can write labeling x* as 
x* = xa/3jy where a/3 — a±, (3j — a 2 and the pattern 
intervals are as follows: 

x a (3 7 y 

[*i,*2 — 1] [Wi] [ji+ljja] [32+1, n] 

Let us show that this choice satisfies conditions in (53). 
Condition (a) holds since a(3 = a\ G IT and ^7 = 
02 £ 11°. Suppose that (b) does not hold, then there 
exists pattern u = x k . e G 11° where k < i 2 and i > j%. 
This means that u G A. 

From the definition of A, there exists pattern u = 
x*~ ~ G A with [k,£] C [fc,^]. To summarize, we have 

k < k < i 2 and ji < £ < £. 

If fc < zi then C [k,£] and thus ct\ (fc A - a con- 

tradiction. Thus, there must hold k > i\. Similarly, 
we prove that £ < j 2 . This implies that ol\ -< u ^ a 2 , 
and therefore patterns ct\ and a 2 are not consecutive 
in A - a contradiction. 

Part (c) (3 is a proper prefix of a that belongs 
to II. Therefore, a*~ is defined. We can define a 
sequence of patterns /3q = j3, /3%, . . . , /3 m = a 2 with 
/3fc = ([«2,s fc ], •), s = ii < si < . . . < s m = j 2 
respectively such that (3 k G IT and pu-x = ^fc~ f° r 
fc G Let us prove by induction on k that 

M Sk {p k )<f{x* 1:Sk )fork€[0,m]. 

Let us first check the base of the induction. Since 
/3, ct\ G IL^ and ct\ = */?, by the definition of graph 
GpjJ there is a (unique) path 70, 71,..., 7,. from 
70 = /3 to 7 r = ai with (7i,7j+i) G ^PjJ. By 
eq. (51), (7/) < M n ( 7/+1 ). Therefore, {(3) < 



Mj ± (ai) < /(a^i jj) where the last inequality holds by 
the assumption of part (c). This establishes the base 
case. 

Now suppose that the claim holds for k — 1 G [0, m — 
1]; let us prove it for k. Denote p = s^—i and s = 
Sfe. Note, p is the index in step 2 of the algorithm 
during the processing of (s,/3fc). From eq. (51) and 
the induction hypothesis we get 

M s {p k ) < M P G8 fc _i) + ip{p k ) < F p {x* 1:p ) + ^(/? fc ) 

We will prove next that f{x*. p )+ip{j3k) = f{x*. 8 ), thus 
completing the induction step. It suffices to show that 
there is no pattern 7 = x*.j with i < i 2 and p < j < s. 

Suppose on the contrary that such pattern exists. By 
the definition of A there exists pattern 7 = {[i, j], •) G 
A with [i,j] C [i,j]. We have i < i < i 2 and j > 
j > p > so = ji. These facts imply that ot\ -< 7 -< 
02, contradicting the assumption that «i and «2 are 
consecutive patterns in A. □ 

Lemma 17 implies the main claim. 

Corollary 18. For each a — {[i,j], •) G A there holds 
Mj{ot) < f{x*.j), and therefore Z < f{x*). 

Proof. We use induction on the total order The 
lowest pattern a G A starts at position 1; for such 
pattern the claim follows by inspecting Algorithm 6. 
The induction step follows from Lemma 17(c). □ 
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